Training and prediction with scikit-learn

This notebook demonstrates how to use AI Platform to train a simple classification model using scikit-learn, and then deploy the model to get predictions.

You train the model to predict a person's income level based on the Census Income data set.

Before you jump in, let’s cover some of the different tools you’ll be using:

  • AI Platform is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.

  • Cloud Storage is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.

  • Cloud SDK is a command line tool which allows you to interact with Google Cloud products. This notebook introduces several gcloud and gsutil commands, which are part of the Cloud SDK. Note that shell commands in a notebook must be prepended with a !.

Set up your environment

Enable the required APIs

In order to use AI Platform, confirm that the required APIs are enabled:


In [ ]:
!gcloud services enable ml.googleapis.com
!gcloud services enable compute.googleapis.com

Create a storage bucket

Buckets are the basic containers that hold your data. Everything that you store in Cloud Storage must be contained in a bucket. You can use buckets to organize your data and control access to your data.

Start by defining a globally unique name.

For more information about naming buckets, see Bucket name requirements.


In [ ]:
BUCKET_NAME = 'your-new-bucket'

In the examples below, the BUCKET_NAME variable is referenced in the commands using $.

Create the new bucket with the gsutil mb command:


In [ ]:
!gsutil mb gs://$BUCKET_NAME/

About the data

The Census Income Data Set that this sample uses for training is provided by the UC Irvine Machine Learning Repository.

Census data courtesy of: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://archive.ics.uci.edu/ml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

The data used in this tutorial is located in a public Cloud Storage bucket:

gs://cloud-samples-data/ml-engine/sklearn/census_data/ 

The training file is adult.data (download) and the evaluation file is adult.test (download). The evaluation file is not used in this tutorial.

Create training application package

The easiest (and recommended) way to create a training application package is to use gcloud to package and upload the application when you submit your training job. This method allows you to create a very simple file structure with only two files. For this tutorial, the file structure of your training application package should appear similar to the following:

census_training/
    __init__.py
    train.py

Create a directory locally:


In [ ]:
!mkdir census_training

Create a blank file named __init__.py:


In [ ]:
!touch ./census_training/__init__.py

Save training code in one Python file in the census_training directory. The following cell writes a training file to the census_training directory. The training file performs the following operations:

  • Loads the data into a pandas DataFrame that can be used by scikit-learn
  • Fits the model is against the training data
  • Exports the model with the Python pickle library

The following model training code is not executed within this notebook. Instead, it is saved to a Python file and packaged as a Python module that runs on AI Platform after you submit the training job.


In [ ]:
%%writefile ./census_training/train.py
import argparse
import pickle
import pandas as pd

from google.cloud import storage

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer

parser = argparse.ArgumentParser()
parser.add_argument("--bucket-name", help="The bucket name", required=True)

arguments, unknown = parser.parse_known_args()
bucket_name = arguments.bucket_name

# Define the format of your input data, including unused columns.
# These are the columns from the census data files.
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

# Categorical columns are columns that need to be turned into a numerical value
# to be used by scikit-learn
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)

# Create a Cloud Storage client to download the census data
storage_client = storage.Client()

# Download the data
public_bucket = storage_client.bucket('cloud-samples-data')
blob = public_bucket.blob('ml-engine/sklearn/census_data/adult.data')
blob.download_to_filename('adult.data')

# Load the training census dataset
with open("./adult.data", "r") as train_data:
    raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
    # Removing the whitespaces in categorical features
    for col in CATEGORICAL_COLUMNS:
        raw_training_data[col] = raw_training_data[col].apply(lambda x: str(x).strip())

# Remove the column we are trying to predict ('income-level') from our features
# list and convert the DataFrame to a lists of lists
train_features = raw_training_data.drop("income-level", axis=1).values.tolist()
# Create our training labels list, convert the DataFrame to a lists of lists
train_labels = (raw_training_data["income-level"] == " >50K").values.tolist()

# Since the census data set has categorical features, we need to convert
# them to numerical values. We'll use a list of pipelines to convert each
# categorical column and then use FeatureUnion to combine them before calling
# the RandomForestClassifier.
categorical_pipelines = []

# Each categorical column needs to be extracted individually and converted to a
# numerical value. To do this, each categorical column will use a pipeline that
# extracts one feature column via SelectKBest(k=1) and a LabelBinarizer() to
# convert the categorical value to a numerical one. A scores array (created
# below) will select and extract the feature column. The scores array is
# created by iterating over the columns and checking if it is a
# categorical column.
for i, col in enumerate(COLUMNS[:-1]):
    if col in CATEGORICAL_COLUMNS:
        # Create a scores array to get the individual categorical column.
        # Example:
        #  data = [
        #      39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married',
        #      'Adm-clerical', 'Not-in-family', 'White', 'Male', 2174, 0,
        #      40, 'United-States'
        #  ]
        #  scores = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        #
        # Returns: [['State-gov']]
        # Build the scores array
        scores = [0] * len(COLUMNS[:-1])
        # This column is the categorical column we want to extract.
        scores[i] = 1
        skb = SelectKBest(k=1)
        skb.scores_ = scores
        # Convert the categorical column to a numerical value
        lbn = LabelBinarizer()
        r = skb.transform(train_features)
        lbn.fit(r)
        # Create the pipeline to extract the categorical feature
        categorical_pipelines.append(
            (
                'categorical-{}'.format(i), 
                 Pipeline([
                    ('SKB-{}'.format(i), skb),
                    ('LBN-{}'.format(i), lbn)])
            )
        )

# Create pipeline to extract the numerical features
skb = SelectKBest(k=6)
# From COLUMNS use the features that are numerical
skb.scores_ = [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0]
categorical_pipelines.append(("numerical", skb))

# Combine all the features using FeatureUnion
preprocess = FeatureUnion(categorical_pipelines)

# Create the classifier
classifier = RandomForestClassifier()

# Transform the features and fit them to the classifier
classifier.fit(preprocess.transform(train_features), train_labels)

# Create the overall model as a single pipeline
pipeline = Pipeline([("union", preprocess), ("classifier", classifier)])

# Create the model file
# It is required to name the model file "model.pkl" if you are using pickle
model_filename = "model.pkl"
with open(model_filename, "wb") as model_file:
    pickle.dump(pipeline, model_file)

# Upload the model to Cloud Storage
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(model_filename)
blob.upload_from_filename(model_filename)

Submit the training job

In this section, you use gcloud ai-platform jobs submit training to submit your training job. The -- argument passed to the command is a separator; anything after the separator will be passed to the Python code as input arguments.

For more information about the arguments preceeding the separator, run the following:

    gcloud ai-platform jobs submit training --help

The argument given to the python script is --bucket-name. The --bucket-name argument is used to specify the name of the bucket to save the model file.


In [ ]:
import time

# Define a timestamped job name
JOB_NAME = "census_training_{}".format(int(time.time()))

In [ ]:
# Submit the training job:
!gcloud ai-platform jobs submit training $JOB_NAME \
  --job-dir gs://$BUCKET_NAME/census_job_dir \
  --package-path ./census_training \
  --module-name census_training.train \
  --region us-central1 \
  --runtime-version=1.12 \
  --python-version=3.5 \
  --scale-tier BASIC \
  --stream-logs \
  -- \
  --bucket-name $BUCKET_NAME

Verify model file in Cloud Storage

View the contents of the destination model directory to verify that your model file has been uploaded to Cloud Storage.

Note: The model can take a few minutes to train and show up in Cloud Storage.


In [ ]:
!gsutil ls gs://$BUCKET_NAME/

Serve the model

Once the model is successfully created and trained, you can serve it. A model can have different versions. In order to serve the model, create a model and version in AI Platform.

Define the model and version names:


In [ ]:
MODEL_NAME = "CensusPredictor"
VERSION_NAME = "census_predictor_{}".format(int(time.time()))

Create the model in AI Platform:


In [ ]:
!gcloud ai-platform models create $MODEL_NAME --regions us-central1

Create a version that points to your model file in Cloud Storage:


In [ ]:
!gcloud ai-platform versions create $VERSION_NAME \
  --model=$MODEL_NAME \
  --framework=scikit-learn \
  --origin=gs://$BUCKET_NAME/ \
  --python-version=3.5 \
  --runtime-version=1.12

Make predictions

Format data for prediction

Before you send an online prediction request, you must format your test data to prepare it for use by the AI Platform prediction service. Make sure that the format of your input instances matches what your model expects.

Create an input.json file with each input instance on a separate line. The following example uses ten data instances. Note that the format of input instances needs to match what your model expects. In this example, the Census model requires 14 features, so your input must be a matrix of shape (num_instances, 14).


In [ ]:
# Define a name for the input file
INPUT_FILE = "./census_training/input.json"

In [ ]:
%%writefile $INPUT_FILE
[25, "Private", 226802, "11th", 7, "Never-married", "Machine-op-inspct", "Own-child", "Black", "Male", 0, 0, 40, "United-States"]
[38, "Private", 89814, "HS-grad", 9, "Married-civ-spouse", "Farming-fishing", "Husband", "White", "Male", 0, 0, 50, "United-States"]
[28, "Local-gov", 336951, "Assoc-acdm", 12, "Married-civ-spouse", "Protective-serv", "Husband", "White", "Male", 0, 0, 40, "United-States"]
[44, "Private", 160323, "Some-college", 10, "Married-civ-spouse", "Machine-op-inspct", "Husband", "Black", "Male", 7688, 0, 40, "United-States"]
[18, "?", 103497, "Some-college", 10, "Never-married", "?", "Own-child", "White", "Female", 0, 0, 30, "United-States"]
[34, "Private", 198693, "10th", 6, "Never-married", "Other-service", "Not-in-family", "White", "Male", 0, 0, 30, "United-States"]
[29, "?", 227026, "HS-grad", 9, "Never-married", "?", "Unmarried", "Black", "Male", 0, 0, 40, "United-States"]
[63, "Self-emp-not-inc", 104626, "Prof-school", 15, "Married-civ-spouse", "Prof-specialty", "Husband", "White", "Male", 3103, 0, 32, "United-States"]
[24, "Private", 369667, "Some-college", 10, "Never-married", "Other-service", "Unmarried", "White", "Female", 0, 0, 40, "United-States"]
[55, "Private", 104996, "7th-8th", 4, "Married-civ-spouse", "Craft-repair", "Husband", "White", "Male", 0, 0, 10, "United-States"]

Send the online prediction request

The prediction results return True if the person's income is predicted to be greater than $50,000 per year, and False otherwise. The output of the command below may appear similar to the following:

[False, False, False, True, False, False, False, False, False, False]

In [ ]:
!gcloud ai-platform predict --model $MODEL_NAME --version \
  $VERSION_NAME --json-instances $INPUT_FILE

Clean up

To delete all resources you created in this tutorial, run the following commands:


In [ ]:
# Delete the model version
!gcloud ai-platform versions delete $VERSION_NAME --model=$MODEL_NAME --quiet

# Delete the model
!gcloud ai-platform models delete $MODEL_NAME --quiet

# Delete the bucket and contents
!gsutil rm -r gs://$BUCKET_NAME
    
# Delete the local files created by the tutorial
!rm -rf census_training